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Taking decisions by comparing two text documents is a new innovative idea. 
Text documents contain details, rules and information related to a domain. 
The judiciary system is an area where many textual documents are available. 
In some documents, rules related to the judiciary are mentioned, such as the 
Indian penal code (IPC) section documents and other documents like first 


information report (FIR), and Investigation report. contain details of 


incidents. Our assumption is that the system can help in making the decision 
by finding the right IPC Section from the result of text similarity between 
IPC section document and FIR, investigation report. In this research paper, 
we preface a new research problem to make decisions to suggest appropriate 
IPC Section for crime related information from user’s input by using vector 
space model and natural language processing techniques. 
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1. INTRODUCTION 

The decision support system (DSS) is a computerized program used for decision-making activities 
aimed at growing the business. Presently, due to the progress in the field of computers, all new documents 
from different areas are being digitalized. Documents related to the judicial system, such as first information 
reports (FIRs), investigation reports, and judgments are available digitally, in which we can extract any 
information by implementing a computerized algorithm. In the past decade, some systems were developed to 
help with decision making by using text similarity algorithms. This system calculates the similarity between 
two legal documents by using concept based similarity, multi-dimensional similarity [1] and embedding- 
based methodologies [2]-[4]. 

Developing DSS to analyze report and finding appropriate Indian penal code (IPC) section 
according is a new idea. Whenever there is any crime in the society, its information is given to the police and 
the police are investigate based on that information. The police prepare a comprehensive report 
(charge sheet) for the court, which mentions sections of the various IPC related to the crime. Knowledge and 
experience of the sections of the IPC is required to prepare the charge sheet, on the basis of which a correct 
and appropriate document is prepared for the court. Apart from the police, some other people or 
organizations can also be users of the system. A lawyer who re-examines the charge sheet and based on his 
experience prepares the background of the crime and presents it to the offender or victim’s side in court. 
Reading and understanding documents manually such a difficult and time taking task for everyone. If 
computer program helps in highlighting important information and checking correctness of result according 
to rules, it will help to understanding document fastly. A common person or organization can also use this 
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system, with which any crime, deception or violation of rights has taken place. The person or organization 
has to enter the details of the incident with them in the system. 

To use the system, the user will have to enter the information of the incident in the form of natural 
language text and after analyzing the incident, the system will decide the section of the IPC. Here, we 
propose a DSS for finding IPC sections (as an appropriate answer) for input of the user. The section of the 
penal code depends on the various situations, circumstances, some other information of the crime and the 
definition defined in IPC document. Therefore, analysis of IPC documents and inputs will be necessary. A 
user may also not write exact word of offense according to penal code document in application, report or 
query as input then our proposed system finds penal code sections as an appropriate answer and related 
information for the user. Our idea is to calculate similarity between every sentence of user’s input and 
description of every section of IPC document. According to similarity value, system will suggest list of most 
appropriate IPC sections for user’s input. 

In earlier days, DSS was developed for decision making for business purposes, but todays, it is 
evolving for many fields like healthcare, security, medicine, manufacturing, and engineering. In literature, 
huge work is available for a variety of decision support systems. In recent years there are many various 
legal/law information systems developed. Quaresma and Rodrigues have proposed a computational linguistic 
theory (syntactic, semantic analysis and semantic interpretation) based approach to develop a 
question-answering system for juridical documents in Portuguese language. Query processing by information 
retrieval and analysis of documents by information extraction are two modules of this question answering 
systems (QAS). This system contained complete set of decisions from several Portuguese juridical 
institutions [5]. Tirpude and Alvi have proposed a keyword-based quality assurance (QA) system for legal 
documents of Indian laws. For this, the author constructs the corpus and knowledge base from legal 
documents and prepared question dataset with answer type. This system suggested answer of query on the 
basis of keywords Indexed term dictionary [6]. Kamdi and Agrawal developed question answering system for 
IPC sections and Indian amendment laws. This QAS select keywords and question type from query and 
response according answer stored in corpus. Authors define that problem lies on intersection of two domains: 
Information retrieval (IR) and natural language processing (NLP) [7]. Sangeetha et al. have proposed an 
information retrieval system is designed to retrieve relevant answers about laws. The user query in a system 
was processed using natural language processing techniques. This system was designed to face dynamic 
queries from the user end instead of stored question answers [8]. 

Text processing is an essential part of every natural language based system. Various machine 
learning approach like decision tree, nearest neighbors, support vector machines, sparse network of windows, 
naive bayes and log-linear model (maximum entropy models) experimented for classification of text 
[8]-[10]. For identifying part-of-speech tagging, name entities and morphological analysis rules-based 
techniques, Google directory and hidden markov model were developed [11]-[15]. For identifying and 
removing stop words from text a latent semantic indexing (LSI), SVM-based approach and deterministic 
finite automata (DFA) were developed [16]-[18]. For solving the issue of statement formation of systematic 
question Template-based approach proposed. This approach worked on domain-specific Wh-type questions 
and imperative questions [19]. 

Calculating text similarity between two different documents is the main task of my research. 
Various approaches have been proposed by different authors for this work. Mihalcea et al. have proposed a 
corpus-based and knowledge-based measures method of for measuring the semantic similarity of short texts 
by exploiting the information that can be drawn from the similarity of the component words 
[20], [21]. Vector space model (VSM) is used for calculating text similarity of small sentences and 
paragraphs [22]-[25]. Graph-based text similarity (GBTS) algorithm maps Chinese texts into graphs then 
calculates the similarity of two texts by comparing their graphs [26]. Xue et al. presented a method of text 
similarity computing to the clinical decision support system. Authors improved TF-IDF algorithm and cosine 
similarity algorithm by combining with eigenvector associated model to determine the case feature weights 
[27]. Duan and Xu presented short text similarity algorithm for finding similar police incidents. This 
algorithm was developed from a novel semantic similarity algorithm word mover’d distance (WMD) [28]. Jo 
proposed the version of k-nearest neighbor (KNN) which considers similarity among attributes for computing 
the similarity between feature vectors [29]. Noufa Alnajran et al. proposed heuristic driven pre-processing 
methodology for enhancing the performance of similarity measures in the context of twitter tweets [30]. 


2. PROPOSED ARCHITECTURE OF SYSTEM 

Based on rationales in previous sections, Figure | presents architecture of DSS for finding the most 
suitable IPC Section of user’s input. In the first layer of the system, user input will be analyzed using NLP 
techniques and in the second layer a knowledge base for the IPC section document will be developed. System 
consists of several components including- 
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— Component for extraction of offence words and crime related information from the user’s input query. 
— Components for analyzing crime related information and definition of selected IPC sections. 

— Relevance matching component for crime: According to the definition of particular IPC sections. 

— Get and show most appropriate IPC sections. 


Detail Report 
of the crime IPC Sections 


with definition 


Analyze Analyze 
offence related selected IPC 
information Section 
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crime H 
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Figure 1. Proposed architecture of system 


3. METHOD 

IPC document and offence report are two different type of unstructured text. Development of such a 
system for determines most appropriate IPC Sections for a crime report from unstructured text document of 
IPC is difficult task. We identify the following steps to achieve our goal. 

— Step 1: Developing a corpus for IPC section document. The IPC document distributes 511 sections in 23 
chapters. Each chapter describes some kind of crime and conditions. In a corpus of IPC section we 
include four parts (IPC section no, root, offence and description of section). 

— Step 2: Apply method of calculating the text similarity between input text and description of IPC 
section. Semantic similarity is a measure of conceptual distance between two objects, based on the 
correspondence of their meanings [31]. 

The IPC section description text and user input text are two different types of documents and there 
is very little chance that they are lexical similar. Our objective is to calculate semantic similarity between pair 
of every sentence of selected IPC section description text with every sentence of user’s input. To calculate 
similarity, follow the following steps: 

i) Apply pre-processing in IPC Section description text and user’s input text. We used natural language 
processing toolkit, NLTK for implementing pre-processing. Steps are: 

—  Tokenization: Tokenization is a procedure of splitting a sentence into list of words. 

— Lower casing: Convert all words in common case (most preferable lower case) because in NLP same 
word in different case treated as a different word. 

— Stop words removal: In a text document, there are so many words (like ‘is’, ‘was’, ‘a’, and ‘the’.) that 
do not signify any importance in processing. So, these words must remove from document before 
processing. 

—  Stemming/lemmatization: Stemming and lemmatization is a process of transforming a word to its root 
form. Lemmatization works better then stemming for converting a word to its root form. 

— After cleaning text document, we found most important words in IPC section description and user’s 
input for further processing. 

ii) Use filtered IPC Section description words as a term. Apply feature engineering for finding feature of 
user’s input text as a vector from term So, feature engineering technique will calculate vector value 
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according to presence of terms or its synonyms word in user’s input. There are several techniques that 
apply to derive relevant features from a text document. 


3.1. Vector space model 

Vector space model is a matrix representation of list of documents and corpus of words. Every row 
represents individual document and columns represent words of corpus. Cell store value ‘0’ or ‘1’. ‘0’ means 
that word not present in document and ‘1’ indicates word occurred in document. In our problem vector 
matrix shows occurrence of terms (selected feature of particular IPC section) in a text document 

(user’s input) and according to cells value we can calculate appearance of IPC Section in sentence. In the 

user's input, there may be many sentences that are not related to the IPC section. If the vector value of all the 

words in the sentence is ‘0’ then system will ignore that sentence for score calculation. We create vectors for 
description of each IPC section and every paragraph of user’s input and the system will use these vectors for 
further calculations. There are some tools for converting text document into a vector. 

i) CountVectorizer: CountVectorizer is a tool provided by the scikit-learn library in Python. It is used to 
transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the 
entire text. Let consider the example for some filtered IPC Section description: 

— DO: public nuisance illegal omission cause common injury danger 

— D1: unlawfully negligent act likely spread infection disease dangerous life 

— D2: malignant act likely spread infection disease dangerous life 
Sample result of CountVectorizer shows frequency of words in each document (DO, D1 and D2) in 
Table 1. If word appears in document then frequency of word becomes ‘1’ otherwise it will be ‘0’. 

ii) TF-IDF: TF-IDF stands for term frequency-inverse document frequency. In this model, we take term 
frequency and inverse document frequency as parameters to decrease the weight of the terms appearing 
commonly in all the sentences. Formulas of calculating TF-IDF stepwise are: 

—  tf(t, d)=count of t in d/number of words in d //term frequency 

—  df(t)=occurrence of t in documents //document frequency 

—  idf(t)=log(N/df(t)) //inverse document frequency 

—  tf-idf(t, d)=tf(t, d)*idf(t) 

Sample result of TF-IDF shows frequency of words in each document (DO, D1 and D2) in Table 2. 
Frequency of each word calculated by its appearance in particular document and all documents. 


Table 1. Sample IPC section vector using CountVectorizer 


act cause common danger public spread Unlawfully 
1 0 0 0 0 1 1 
1 1 0 0 0 0 1 0) 
0 1 1 1 1 0 0 
Table 2. Sample IPC section vector using TF-IDF 
act cause common danger public spread Unlawfully 
0 0.309228 0 0 0 0 0.309228 0.406598 
1 0.33847 0 0 0 0 0.33847 0 
2 0 0.353553 0.353553 0.353553 0.353553 0 0 


— Step 3: Calculate Cosine similarity between vectors of every paragraph of users input with vector of 
each IPC Section description. Cosine similarity measures the similarity between two vectors of an inner 
product space as shown in Figure 2. It is measured by the cosine of the angle between two vectors and 
determines whether two vectors are pointing in roughly the same direction. It is often used to measure 
document similarity in text analysis. Values range between -1 and 1, where -1 is perfectly dissimilar and 
1 is perfectly similar. 


A.B YL Ai X Bi 


IAI|XIIBIL [ong ioe 
| | din Ai x din Bi 


— Step 4: According to this calculation of cosine similarity, system will show list of most appropriate IPC 
sections that’s closely related to users input. Here one document is description of IPC section and 
another document is paragraph of user’s input. 


Similarity (A, B)= 
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Document 1 


Document 2 


Cosine Distance 


Figure 2. Cosine distance similarity 


4. RESULTS AND DISCUSSION 
4.1. Development of corpus 

There are 511 sections in IPC document thats are devided into 23 chapters. We have selected 4 
chapters of the IPC document, which are chapters 14, 15, 16 and 22, to prove the presumed correctness of our 
proposed work. We developed corpus for sections (around 120) of these chapter as shown in Table 3. 


4.2. Select complain for input 

We have selected the complaint text as shown in Figure 3 related to these chapters as the input 
query. These complaints are available in the form of FIR. on the official portal of state police in India. The 
FIR is divided into paragraphs which contain the offense and its related information. 


4.3. Similarity calculation 

Count vector and TF-IDF model applied to calculate text similarity between each paragraph of 
complaint with description of each section and found list of most appropriate ‘10’ IPC sections that’s most 
related to complain as shown in Table 4. As a result both models produce some list of IPC sections. This list 
and its sequence are different in result of both model but most of sections are common related to complain. 
Based on the output of these models, the system can act as decision support for the user. 


Table 3. Corpus for IPC section document 


Section Root Offence Description 
268 nuisance Public nuisance — Public nuisance, illegal omission which causes any common injury, danger 
269 negligently Negligent act Unlawfully, Negligent act likely to spread infection of disease dangerous to life 
270 malignant Malignant act Malignant act likely to spread infection of disease dangerous to lifedangerous to life 


Jayendra Saraswathi is the head of the Kanchi Mutt, one of a prominent Hindu monastic institutions 
in the country. An investigative journalist named Dhanasekaran Prakash in the Tamil weekly 


Nakkeeran alleged the reasons of the murder being the continuous infuriation by Sankararaman 
against Jayendrar and Kanchi Mutt. ..........J | 


Figure 3. Sample complaint text 


Table 4. Comparision of count vector and TF-IDF result 


Count Vector Result TF-IDF Result 
related_ipcs_index related_ipcs_index 
[118 42 48 41 49 51 66 26 123 43] [118 42 66 26 48 123 49 41 141] 
(IPC’, 364, ':', ‘Kidnapping or abducting in order to murder') (‘IPC’, 364, ':', 'Kidnapping or abducting in order to murder’) 
(IPC’, 303, ':', ‘Punishment for murder by life-convict') (IPC’, 303, ':', ‘Punishment for murder by life-convict') 
(IPC’, 307, ':', ‘Attempt to murder’) (TPC’, '320F, ':', 'Grievous hurt') 
(TPC’, 302, ':', "Punishment for murder') (‘IPC’, 290, ':', "Punishment for public nuisance in cases not 
(IPC’, 308, ':', ‘Attempt to commit culpable homicide’) otherwise provided for') 
(IPC', 310, ':', 'Thug') (IPC’, 307, ':', ‘Attempt to murder’) 
(TPC’, '320F, ':', 'Grievous hurt') (IPC’, '366B', ':', Importation of girl from foreign country’) 
(IPC', 290, ':', 'Punishment for public nuisance in cases not (‘IPC’, 308, ':', ‘Attempt to commit culpable homicide’) 
otherwise provided for') (TPC’, 302, ':', "Punishment for murder') 
(IPC’, '366B',':', ‘Importation of girl from foreign country’) (IPC, '376C’, ':', Intercourse by superintendent of jail and remand 
(IPC', 304, ':', 'Punishment for culpable homicide not amounting home’) 
to murder') (IPC', 310, ':', 'Thug') 
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5. CONCLUSION 

This research paper starts with an introduction of a problem in judicial system and finds solution by 
using decision support system (DSS). DSS aims to help make the best decision based on existing 
information. Over the past few decades, a number of information retrieval (IR) system and question 
answering systems (QAS) have been developed to find result and answers in a limited specific area. IR 
system and QAS takes single line question and apply NLP techniques to extract keyword and search result. 
Here we propose the architecture of DSS for crime incident documents which suggest the list of most 
applicable IPC section by comparing the user input document and IPC section document by vector space 
model. Our proposed system enhances the working of typical question answering system and help to take 
decision on the basis of result. In the future, some other text similarity algorithms such as word2vec, 
doc2vec, and BERT (sentence transform). will use to check the acureacy of the system. 
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